Classification: Biological Sciences, Biophysics 

Title: Protein threading by learning 
Authors: 

Iksoo Chang^'^, Marek Cieplak^'^, Ruxandra I. DiMA^, Amos Maritan^, and 
Jayanth R. Banavar^ 

^Department of Physics, 104 Davey Laboratory, The Pennsylvania State University, Univer- 
sity Park, PA 16802, USA 

^Department of Physics, Pusan National University, Pusan 609-735, Korea 
^Institute of Physics, Polish Academy of Science, 02-668 Warsaw, Poland 
■^Institute for Physical Science and Technology and Department of Chemistry and Biochem- 
istry, University of Maryland, College Park, Maryland 20742, USA 

^International School for Advanced Studies (SISSA) and Abdus Salam International Center 
for Theoretical Physics, Via Beirut 2-4, 34014 Trieste, Italy, and Instituto Nazionale di Fisica 
della Materia, Italy 

Corresponding author: Jayanth R. Banavar, 104 Davey Laboratory, The Pennsylvania 
State University, University Park, Pennsylvania 16802, phone: 814-863-1089, FAX: 814-865- 
0978, email: jayanth@phys.psu.edu 



pages: 14 
figures: 4 

tables: 1 and 2 supplementary tables 
abstract: 74 words 
paper: 40,435 characters 



Abstract 



Using techniques borrowed from statistical physics and neural networks, we de- 
termine the parameters, associated with a scoring function, that are chosen opti- 
mally to ensure complete success in threading tests in a training set of proteins. 
These peirameters provide a quantitative measure of the propensities of amino 
acids to be buried or exposed and to be in a given secondeiry structure and aie 
a good steirting point for solving both the threading and design problems. 



2 



The principal objective of this paper is a demonstration of the viabihty of a framework, 
based on ideas from statistical physics and neural networks, for attacking the protein thread- 
ing problem. Our work points to the difficulty associated with a commonly used statistical 
procedure for determining such parameters. We present the results of threading and design 
tests and present a singular value decomposition (SVD) analysis of the parameters which 
elucidate the interplay between degree of burial and secondary structure propensities in the 
folding problem. 

The challenge of the protein folding problem (1-5) is to deduce the native state structure 
and thence the functionality of a protein from the knowledge of the sequence of amino acids. 
The successful completion of the human genome project has heightened interest in this 
problem. The information readily available as input are the sequences and native structures 
of a few thousand proteins (6). Given an entirely new sequence, one needs to have a sound 
strategy for determining its native state structure. A simpler problem, threading (7), relies 
on the belief that the total number of distinct folds in nature is only a few thousand (8) 
and attempts to match the new sequence with the best among a selection of possible native 
state structures. (A difficulty associated with threading is that due to steric constraints, one 
may not be able to mount a given sequence on a piece of a native structure of a different 
sequence. See, for example rcf. 9) In order to assess the ffi of a given sequence with a 
putative native state structure, one might use a coarse grained representation of the amino 
acids in a sequence and postulate a scoring function with a simple functional form. Perhaps 
the simplest such function is one which characterizes the propensities of the various types of 
amino acids to be in different environments: 

Sis,T) = Y^^n{i,m) e{i,m) , (1) 

i m 



3 



where S is the score function which is a measure of the match of a sequence, s, and target 
structure, F, n{i, m) is the number of amino acids of type i in the environment m and e(i, m) 
is the score associated with it (10). For a given amino acid each of the e(i,m)'s may 
be shifted by the same arbitrary constant so that, without loss of generahty, one may set 
X^^e(i,m) = 0. The advantages of such an environmental scoring function over pair-wise 
interactions between amino acids are its simplicity and the far greater ease of incorporating 
gaps in both sequence and in structure. 

Our focus is on determining the score quantifying the match of a sequence to a putative 
native state structure, the most common approach for which utilizes statistical considera- 
tions (11-13), based on counting the number of amino acids in a given environment in the 
native state. Pioneering work by Bowie et al. (10) has shown that a simple statistically 
based approach with an environmental score leads to excellent results for the inverse folding 
problem. 

Our studies used a training set of 387 proteins (see Table I in Supplementary Informa- 
tion) from the PDBselect (6,14) consisting of sequences varying in length from 44 to 1017 
with low sequence homology and covering many different 3D-folds according to the SCOP 
classification (15). Additional criteria used in selecting the proteins in the training set were: 
a) the protein structure was obtained through X-ray crystallography, b) the structures were 
monomeric, c) the determined structures missed no more than two amino acids. The same 
criteria were used to obtain a test set of 213 distinct proteins (Table 2 in Supplementary 
Information) with lengths ranging between 54 and 869. For each structure, we used a simple 
environmental classification which consists of the local secondary structure (a-helix, /9-strand 
or other) and the exposed area evaluated as the ratio between the accessible area of each 
amino acid, X, of its native sequence (having this structure as its native state) and the 



4 



corresponding area in a Gly-X-Gly extended structure. The values of the exposed area were 
divided into three categories of small, medium and large exposures corresponding to < 10%, 
10 — 50%, and > 50% respectively. Thus the scoring function consists of nine parameters for 
each amino acid corresponding to each of the nine environments that it might be found in. 



We begin by applying the ideas of Bowie et al. (10) to the threading problem. The 
statistical score es{i, m) associated with amino acid i in an environment m is readily deduced 
using the expression 



where P{i,m) is the probability of finding an amino acid of type i in the environment of 
type m and P{i) is the probability of finding an amino acid of type i in any environment. 
Both P{i,m) and P{i) are determined from a knowledge of the sequences and native state 
structures of the proteins in our training set. In order to assess the quality of the extracted 
scores, we carried out threading tests on all but the largest protein in the training set itself. 
Each protein sequence was mounted on its own native state structure and on every fragment 
(of the correct length chosen without insertions and deletions) of all the larger proteins. The 
exposed area for the amino acid mounted on a fragment was assumed to be the same as that in 
the whole protein from which the fragment was extracted. As we shall see later, this may be 
a poor approximation when the size of the fragment is much smaller than the whole protein. 
In each case, Eqn. (1) was used to determine the scoring function. While the technique is 
simple, the results of gapless threading tests are only moderate - the native state structure 
is correctly recognized for 69% of the proteins. In a recent paper. Baud and Karlin (16) 
considered 418 proteins and determined the frequencies of occurrence of the twenty amino 



Materials and Methods 




(2) 



5 



acids in nine environments which were defined in a way similar to ours. We have converted 
their frequencies into statistical scores (which turn out to be similar to the statistical scores 
derived from our training set), using equation (2), and find 54 failures in our set of 213 
proteins. This moderate performance may be due to the fact that the form of the scoring 
function is too simple. Support for this comes from earlier work which has pointed out the 
difficulty of determining the optimal interactions that stabilize the native state of even 1 
protein (crambin) with a more complex scoring function involving 210 pairwise interactions 
(17). An alternative possibility, that the statistical approach is flawed (18) would be of more 
serious concern because such statistical schemes are commonly used in the protein folding 
problem. 

We turn to a demonstration that an alternative strategy based on ideas originating in 
statistical physics and neural networks provides a powerful framework for tackling the thread- 
ing problem. Following the pioneering work of Priedrichs and Wolynes (19) and especially 
Goldstein et al. (20), the basic idea is to postulate the form of a scoring function and to 
choose its parameters to ensure that the true native states of proteins with known structures 
(learning set) correspond to better (lower) scores than when the sequences are housed in 
competing decoy conformations (17-28). An important advantage of this procedure is that it 
can be used to verify whether the chosen form of the scoring function is equal to the task or 
not. Indeed, one may start with the simplest form of the scoring function and systematically 
expand the parameter space until the optimal interactions are learned. The statistical pro- 
cedure considers proteins and their native state structures, whereas the learning procedure 
has information on competing structures as well. Our scheme is similar in spirit to that 
of previous work with the important differences that we consider an environmental scoring 
function instead of a pairwise contact potential and we optimize the energy gap without any 



6 



normalization. 

The total number of inequalities (one obtains the inequalities for each sequence in 
the training data set by considering as decoys all pieces of the native state structures 
of longer proteins in the training set) is over 13 million making the problem techni- 
cally difficult. For a given protein, each decoy leads to a linear inequality of the form 

e{i,m) > 0, where n{i,m)^ is the number of amino acids 
of type i found in the environment m in the given decoy. The perceptron procedure is a 
simple technique based on neural networks for simultaneously solving a set of such linear 
inequahties (29). We used this procedure to optimally choose the 180 parameters in order 
to ensure that the worst inequality (among the more than 13 millions) was satisfied as well 
as possible and that threading tests on the training set were 100% successful. 

Results 

We describe the results of several tests and a biological interpretation of these parame- 
ters: 

Learning procedure versus statistical approach: 

Figure 1 shows a plot of the parameters determined using the statistical approach versus 
those deduced by the learning procedure. The poor correlation is consistent with the quali- 
tatively different performance levels in threading. It underscores the fundamental difficulties 
of the statistical approach and points to the advantage of learning the optimized parameters 
in a systematically expanded parameter set. 

Threading tests: 

The couplings €387 (i, m) obtained based on learning the native states of the 387 proteins 



7 



in the training set were subjected to threading on the test set containing 213 distinct proteins 
(Table II in Supplementary Information) and the decoys obtained from their native state 
structures. In contrast to the performance of the statistical parameters, for which one is 
unable to correctly recognize the native states of 76 of the 213 proteins, the number of failures 
when one uses the learned parameters is 23. The failed proteins are modest in size and have 
sequence lengths ranging between 54 and 131. The ranks of the native states, defined as 
the number of better performing decoys, of the failed proteins are plotted in Figure 2 as a 
function of the sequence length for both sets of parameters (note the dramatically different 
scales of the y-axis). For the poorest performer, using the perceptron based method, there 
are 102 decoys (out of 37617) that perform better than the native state (protein labq of 
length 56) while the corresponding number for the statistically derived parameters is 29424 
(out of 31436 decoys for protein Ivqb of length 86). We have checked that around half of the 
failures are spurious for the case of the learned parameters and arise because the exposed 
areas for the winning decoy, which is a piece of the native state structure of a longer protein, 
is quite different from that determined for the whole protein. This effect of an inaccurate 
assignment of the exposed area is strong only for small sized proteins. The remaining failures 
(a total of 5 %) is likely due to the identification of genuine competitors to the native state 
or because the winning decoy is not a viable structure for the sequence under consideration 
(9). 

We also tested the 6337 parameters on all 600 proteins (Table I and II in Supplementary 
Information) and the decoys obtained using all 600 native state structures. There were 57 
failures whereas the statistically derived parameters resulted in 209 failures. We used the 
perceptron procedure (29) to learn the scoring parameters in order to ensure that the native 
state of all the proteins in the training and test set were recognized with 100 % success and 



8 



the energies of all decoys were pushed up as much as possible compared to the native state 
energies. In the rest of the paper, we will use this refined set of optimal parameters e{i, m) 
(Table I) to carry out our further studies. Note that the sum of the first nine entries of each 
row in Table I is equal to zero and the sum of the squares of all such 180 entries has been 
chosen to be 180. 

The native state of crambin (Icrn) which was not part of the training set, is recognized 
in threading. This result is encouraging because of earlier difficulties in learning pairwise 
parameters for this protein (19). It should be noted, however, that a single amino acid 
mutation of Icrn, the protein Icbn, was present in the basic learning set of 387 proteins. 

As a further test, we selected 26 Globin proteins from the RCSB website 
( |http: / / www.rcsb.org/ pdb/| ) which were in the DEOXY form, which were not mutated and 



whose structures are resolved well. Strikingly, 23 of the 26 proteins correctly picked their 
own native state from among the millions of decoy conformations obtained from the frag- 
ments of the 600 proteins in the training and test sets described previously. For the 3 other 
cases, fragments from the Globin family were picked to be the best structure. Indeed, the 
scores of the Globin proteins on fragments of other Globin structures were generally lower 
than on fragments of structures of unrelated proteins underscoring the quality of our scoring 
function. 

Biological interpretation of learned parameters: 

Let us begin with a geometrical picture of e(i, m), considered to be twenty vectors of nine 
components each. For a given amino acid i, the components of the nine-dimensional vector, 
labelled by the index m, capture the propensities of that amino acid to be in each of nine 
environments. Each environment may be thought of as representing an axis in an orthogonal 



9 



9-dimensional space. Singular value decomposition (28, 31) affords a simple prescription for 
dimensional reduction by the optimal choice of a new set of orthogonal axes. In this new 
reference frame, the original vectors span a lower- dimensional space and the axes may be 
conveniently rank-ordered in importance. 

The SVD theorem (31) states that the 20 x 9 (non-square) matrix e can be written as 

e = yy^, (3) 

where y is a 20x9 dimensional matrix and y is a 9x9 dimensional matrix. The superscript 
^ denotes the transpose matrix. The matrix Y is given by y = C/E, where E is a 20 x 9 
dimensional matrix whose elements are all zero except for the diagonal terms, S„ „, n=l,...,9. 
These diagonal terms are equal to the the square roots, of the common eigenvalues of 
ee^ and e^e. The cr^'s are called singular values and are assumed to be rank ordered so 
that (Ti is the largest. Here, they are: 10.59, 5.02, 3.98, 3.42, 2.57, 2.09, 1.77, 0.95, and 0.0. 
The columns of V, denoted by Vk, are the eigenvectors corresponding to the rank ordered 
eigenvalues of the matrix e^e and the columns of the 20 x 20 matrix U, denoted as Uk, 
k—l,...,20, are determined by the formula Uk — -^^Vk (when the singular values are non- 
zero; the other cases are irrelevant for the reconstruction of the e parameters). The result of 
the SVD transformation is that e{i, m) may now be represented as a sum of contributions that 
diminish in an overall sense as one considers smaller singular values. The n'th contribution 
is given by {i) vj'^-^ (m) , where the first factor depends on the amino acid and the second 
on the environmental index. Thus v are the new orthogonal and normalized directions — 
or modes — in the space of environments. 

Figure 3 shows the three most dominant contributions, corresponding to the top three 
singular eigenvalues. Each contribution is displayed in two panels. The upper panels show 



10 



the mode as a function of the nine environmental parameters. The lower panels show the 
corresponding amplitudes plotted so that increases monotonically. 

The first mode is dominant for 13 amino acids: C, F, I, V, H, S, T, N, P, Q, E, R, 
and K. The second mode is the leader for W, M, Y, and D and the third for L and A. 
The remaining amino acid, G, is dominated by the fifth mode. The first mode provides 
the overall dominant behavior and strongly distinguishes between the buried and exposed 
environments in a monotonic way regardless of the secondary structure — allows one to 
arrange the amino acids into buried and exposed groups depending on whether it is large and 
positive or large and negative. One may further subdivide the two basic groups of buried 
(B) and exposed (E) amino acid into subgroups: Bi, B2, B3, Ei, and E2. The division is 
illustrated in Figure 3 and corresponds to occurrences of more rapid variations in as one 
moves from one amino acid to the next. The key point is that the amino acids in Bi have 
a strong tendency to be buried and the charged amino acid K in E2 has a strong tendency 
to be exposed, and most of the amino acids are more sensitive to the degree of burial than 
to other considerations. This tendency for burial is usually associated with hydrophobicity 
in the protein folding problem (32-34). The hydrophobic amino acids F, I, V, L, and A 
do belong to group B but this group also contains polar amino acids. Cysteine, C, shows 
the strongest propensity to be buried. It should be noted that a pair of C's may form a 
strong contact by establishing a disulfide bridge. Of the 896 C-C contacts generated in 
our study of 600 proteins, 402 had both C's buried whereas only in four cases were both 
of the C's exposed (independent of the secondary structure). This tendency alone yieds a 
high statistical score for C being buried. (Note that 37% of the structural sites of the 600 
proteins are classified as buried, 40 % as medium, and 33 % as exposed). The learned score 
is even further accentuated because most of the decoys correspond to C being not buried 



11 



and stability of the native state with respect to decoys is enhance by such an adjustment. 

The remaining modes break the symmetry between the secondary structures. The second 
mode is neutral to a and favors (disfavors) (3 (loop) when the coefficient 1/(2) is negative. It 
shows a strong preference for amino acids, such as W, with a large negative y(2) to be in a 
/3-strand with a large exposed area and for amino acids, such as D, with a large positive y(2) 
to be in loops with a large exposed area. The third mode introduces a preference for C, F, 
K, etc. to be in /5-strands with medium exposure and for L, P, and A, etc. to stay either in 
exposed /^-strands or in buried loops and avoid exposed helices. 

Protein design: We turn now to an extension of our studies to protein design or the 
inverse folding problem. In analogy with equilibrium statistical mechanics, the probability 

that a sequence s is housed in a structure F is given by (35-38) 

g-5(s-r)/T g-5(5,r)/r 

where T, here, is a fictitious temperature, the score S has been assumed to play a role 
analogous to the energy and F, the free score, plays the role of the free energy. The key 
point is that in the limit of T ^ and when f is the native state structure of s, P ^ 1. In 
this limit, therefore, the "free score" which is a function of the sequence alone approaches 
the score of the sequence in its native state. The last column of Table 1 shows the average 
contribution to the native state scores. Si, from each type of amino acid in the various 
environments. It is defined by Si = ]^ Z]£i ^(^i '"^(^))i where the sum is over the Ni 
occurrences of amino acid i in the native state of all 600 proteins in the training and test 
sets. The zero "temperature" free score of a sequence may then be readily deduced without 
any knowledge of the structure, by adding up these contributions for the amino acids in 
the sequence. Figure 4 shows a plot of the native state score versus the sequence free score 

12 



for all 600 proteins. The latter, which has no structure dependent information, provides a 
reasonable approximation to the actual native state score. We have verified that both are 
linearly proportional to the protein length and for the longer proteins, the native state score 
is somewhat higher than the free score due to the increasing tendency toward frustration as 
the sequence length increases. For design purposes, the free score provides a measure of the 
score one is entitled to expect in a typical native state structure and the lower the score in 
the target native state structure with reference to the free score, the better is the design. 

Stability of cold shock proteins: We used the learned parameters to provide a molecular 
interpretation of the different thermal stabihties of a pair of cold shock proteins (39), one 
of which is mesophihc Bacillus subtihs (Bs-CspB: Icsp) and the other thermophilic Bacillus 
caldolyticus (Bc-Csp: lc9o). The former has a score of -34.80 in the native state, whereas 
the latter is more stable with a score of -41.64. More strikingly, the free scores are -28.64 and 
-27.97 respectively underscoring the much better design of the thermophilic protein. We also 
used the conformation space of all decoys to estimate the "heat capacity" of the two proteins 
as a function of temperature. The heat capacity which is a measure of the fluctuations in the 
score (viewed as an energy) shows a peak as a function of the temperature in both cases. The 
peak temperature, which is a measure of the folding transition temperature of the protein, 
of lc9o is higher than that of Icsp, and reflects the better thermal stability of lc9o in accord 
with the experimentally observed behavior (39). 

Conclusion 

In summary, we have shown that a straightforward learning scheme leads to the de- 
termination of excellent environmental parameters which can be used in simple threading 
tests. Our results point to the danger of employing statistical procedures for estimating these 

13 



values. The learned parameters capture information on the environments in the competing 
structures in addition to that in the native state structures and allows for a stabilization 
of the native state with respect to decoy structures. Our procedure validates the notion 
that in the simplest cases we have studied here, a simple environmental scoring function 
is sufficient for capturing the essential features of protein threading. Our method has the 
distinct advantage of ease of expanding the parameter space and opens up the possibility 
of using the scoring parameters determined here as a starting point for learning the penalty 
parameters characterizing insertion and deletion. 

This work was supported by grants from NASA, INFM and MURST (Italy), the Donors 
of the Petroleum Research Fund administered by the American Chemical Society, PNU 
research fund and KBN (grant number 2P03B-146-18). 



14 



1. Anfinsen, C. (1973) Science 181, 223-230. 

2. Wolynes, P. G., Onuchic, J. N., & Thirumalai, D. (1995) Science 267, 1619-1620. 

3. Dill, K. A. & Chan, H. S. (1997) Nature Struct. Biol. 4 10-19. 

4. Fersht, A. P. (1998) Structure and mechanism in protein science: A guide to enzyme 
catalysis and protein folding.. New York, Freeman. 

5. Baker, D. A. (2000) Nature 405, 39-42. 

6. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., 
Shindyalov, I. N. & Bourne, RE. (2000) Nucl. Acid. Res. 28, 235-242. 

7. Jones, D. T., Taylor, W. R., & Thornton, J. M. (1992) Nature (London) 358, 86-89. 

8. Chothia, C. (1992) Nature 357, 543-544. 

9. Ramachandran, G. N. & Sasisekharan, V. (1968) Adv. Prot. Chem. 28, 283-437. 

10. Bowie, J., Liithy, R. & Eisenberg, D. (1991) Science 253, 164-170. 

11. Tanaka, S. & Scheraga, H. A. (1976) Macromolecules 9, 945-950. 

12. S. Miyazawa, S. & Jernigan, R. L. (1985) Macromolecules 18, 534-552. 

13. Zhang, C. & Kim, S. (2000) Proc. Natl. Acad. Sci. 97, 2550-2555. 

14. Hobohm, U. & Sander, C. (1994) Prot. Sci. 3, 522-524. 

15. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 
536-540. 

15 



16. Baud, F. & Karlin, S. (1999) Proc. Natl. Acad. Sci. 96, 12494-12499. 

17. Vendruscolo, M., Najmanovich, R. & Domany, E. (1999) Phys. Rev. Lett. 82, 656-659. 

18. Dima, R. I., Banavar, J. R. & Maritan, A. (2000) Protein Sci. 9, 812-819. 

19. Priedrichs, M. S. & Wolynes, R G. (1989) Science 246, 371-373. 

20. Goldstein, R., Luthey-Schulten, Z. A. & Wolynes, P.G. (1992) Proc. Natl. Acad. Sci. 
89, 9029-9033. 

21. Koretke K.K., Luthey-Schulten, Z. A. & Wolynes, P.G. (1996) Protein Scienceh, 1043- 
1059. 

22. Maiorov, V. N. & Crippen, G. M. (1992) J. Mol. Biol. 227, 876-888. 

23. Mirny, L. A. & Shakhnovich, E. I. (1996) J. Mol. Biol. 264, 1164-1179. 

24. Clementi, C. Maritan, A. & Banavar, J. R. (1998) Phys. Rev. Lett. 81, 3287-3290. 

25. Dima, R. I., Settanni, G., Micheletti, C. Banavar, J. R. & Maritan A. (2000) J. Chem. 
Phys. 112, 9151-9166. 

26. Vendruscolo M., Mirny L.A., Shakhnovich E. I. & Domany E. (2000) Proteins: Struc- 
ture, Function, and Genetics 41, 192-201. 

27. Tobi, D. & Elber, R. (2000) Proteins: Structure, Function, and Genetics 41, 40-46. 

28. Tobi, D., Shafran, G., Linial, N. & Elber, R. (2000) Proteins: Structure, Function, and 
Genetics 40, 71-85. 

29. Krauth, W. & Mezard, M. (1987) J. Phys. A 20, L745-L752. 

16 



30. Park, B. & Levitt, M. (1996) J. Mol. Biol. 258, 367-392. 

31. Watkins, D. S. (1991) Fundamentals of Matrix Computations., Wiley, New York. 

32. Kauzmann, W. (1959) Adv. Protem Chem. 14, 1-63. 

33. Dill, K.A. (1990) Biochemistry 29, 7133-7155. 

34. Kamtekar, S., Schiffer, J. M., Xiong, H. Y., Babik, J. M. & Hecht, M. H. (1993) Science 
262, 1680-1685. 

35. Deutsch, J. M. & Kurosky, T. (1996) Phys. Rev. Lett. 76, 323-326. 

36. Seno, F., Vendruscolo, M., Maritan, A. & Banavar, J. R. (1996) Phys. Rev. Lett. 77, 
1901-1904. 

37. Dima, R. I., Banavar, J. R. Cieplak, M. & Maritan, A. (1999) Proc. Natl. Acad. Sci. 
96, 4904-4907. 

38. Micheletti, C, Maritan, A. & Banavar, J. R. (1999) J. Chem. Phys. 110, 9730-9738. 

39. Perl, D., Mueller, U., Heinemann, U. & Schmid, F. (2000) Nature Struct. Biol. 7, 
380-383. 



17 



Table Captions 

Table 1: Table of e, the nine environmental scores for each amino acid. Large negative 
values indicate a strong preference for the particular environment whereas large positive 
values indicate an aversion. The last column shows Si which is a measure of the average 
contribution of each amino acid to the native state score and provides an estimate of the 
expectation of the contribution of a given amino acid to the native state score. 



18 



Table I. 180 Environmental Scores 









a 






P 






Utner 




Sj 


Amino Acid 


bmall 


Med. 


Expo. 


bmall 


Med. 


Expo. 


bmall 


TV /T^ J 

Med. 


Expo. 




CYS 


(C) 


1 on 

-i.zy 


0.07 


1.81 


1 TO 

-1.78 


-0.83 


O CO 

3.63 


1 O /I 

-l.z4 


A O C 

-0.85 


0.49 


-I.Ud 


PHE 


(F) 


-U.9U 


-0.35 


2.66 


-1.77 


1 AO 

-1.02 


1.51 


A or* 

-0.26 


A OO 

-0.28 


0.74 


A TO 

-0.73 


TRP 


(W) 


U.4i 


0.32 


1.d4 


1 1 O 

-1.18 


-1.00 


T AO 

-l.Uz 


ACT 

0.57 


A C A 

0.50 


A A T 

0.91 


A AT 

-0.07 


ILE 


(I) 


-U.oU 


-0.2 / 


n oo 
U.oo 


-U.ZO 


c\ on 

-o.oy 


A P.'X 

U.Dl 


-1.05 


0.56 


c\ no 
0.92 


c\ on 
-0.29 


VAL 


(V) 


U.4z 


O.Od 


-U.iz 


1 AO 


-0.64 


A OA 

(J. 89 


-0.28 


0.57 


r\ c o 
0.58 


r\ o 1 
-0.31 


MET 


(M) 


-U.ZD 


-0.3d 


0.65 


-0.52 


A 'VI 

0.71 


1 O/^ 

l.ZD 


A O /I 


rt TT 

-0.77 


A /IT 

-0.47 


A OA 

-0.30 


LEU 


(L) 




-O.lb 


0.09 


-0.32 


(\ oo 
0.83 


A 

-(J.7d 


A C /I 


0.77 


0.41 


-0.10 


VJTXJ J- 




(J.OD 


l.lb 


0.73 


-0.05 


A 1 

0.16 


A 1 /I 

U.14 


A /( A 

-(J.49 


-0.95 


-1.06 


(\ AO 

-0.48 


TYR 


(Y) 




0.83 


-0.06 


n /I o 

-0.42 


-1.18 


A O O 

-0.23 


A O O 

0.23 


A AO 

0.08 


0.63 


A A A 

0.00 


ALA 


(A) 


Ci An 
-(J.4(J 


-0.05 


-0.13 


0.27 


A C A 

0.50 


A 1 C 

-0.15 


A OO 

-0.23 


A O C 

0.35 


n o c 
-0.25 


-0.06 


HIS 


(H) 


1.05 


-0.60 


-0.82 


0.62 


0.56 


0.14 


-0.29 


-0.08 


-0.57 


-0.09 


ASP 


(D) 


-0.29 


-0.79 


-0.90 


1.31 


0.93 


1.32 


0.59 


-0.99 


-1.17 


-0.60 


SER 


(S) 


-0.31 


-0.01 


-0.98 


0.48 


0.78 


-0.75 


1.00 


-0.32 


-0.10 


0.03 


THR 


(T) 


0.80 


0.49 


-0.46 


0.55 


-0.50 


-0.80 


0.74 


-0.34 


-0.48 


0.01 


ASN 


(N) 


0.67 


-0.66 


-0.66 


1.34 


0.60 


-0.06 


0.55 


-0.48 


-1.30 


-0.39 


PRO 


(P) 


2.35 


-0.28 


-0.88 


1.32 


1.03 


-0.30 


-1.02 


-0.62 


-1.61 


-0.65 


GLN 


(Q) 


1.74 


-0.84 


-1.24 


0.94 


-0.87 


-1.07 


1.32 


0.01 


0.01 


-0.26 


GLU 


(E) 


0.83 


-0.81 


-1.28 


1.67 


-0.21 


-0.67 


1.60 


0.04 


-1.16 


-0.53 



ARG (R) 2.29 -0.80 -1.37 1.37 -1.16 -1.35 1.82 0.13 -0.94 -0.38 
LYS (K) 1.20 -1.13 -1.77 4.32 -1.43 -1.91 2.38 -0.32 -1.35 -1.11 



20 



Supplementeiry Table I. PDB codes for 387 leeirning proteins 



ICII 


IKCW 4HB1 


IDHX 


IBLE 


ISQC 


1AB4 


IPKP 


IFIY 


IRGS 


ITDJ 


IFSZ 


8LDH 


6ICD 


lAEP 


lANV 


5PTD 


lAOI 


1B6E 


ISIG 


IDIV 


IKXU 


IBVB 


IBAJ 


lAfiF 


1LR.V 


2ITG 


ICBY 


ILXA 


1914 


1CC5 


lAOP 


lOHK 


IJON 


IPJR 


1A7J 


lAUA 


IGLN 


HHP 


IHLB 


IZAP 


2STV 


1A17 


IRDR 


IRLW 


lUBY 


2SAS 


IBCO 


lASY 


1AX8 


2LIV 


lANS 


20MF 


1A41 


1C25 


1AK5 


lAQT 


lA.Tfi 


2FXB 


IBOB 


IINP 


ICYX 


IXSM 


IBIA 


ICPT 


lOPR 


IPLCJ 


lAUQ 


IKIT 


ICTN 


IDHR 


lOBR 


IRCB 


1A26 


ICIY 


IGPC 


IPFO 


IGRJ 


1BY9 


IMAZ 


ILBA 


IKTE 


1AM2 


1BB9 


IHTP 


IBIX 


ITUL 


IDRW lAQE 


2GSq 


IDHS 


6EAU 


ICFR 


IGBN 


1BR9 


lACC 


lYGS 


1B5L 


HAM 


1A32 


IRMD 


IPEA 


ISEK 


IKLO 


lOXA 


ICRB 


2TCT 


lESC 


ITFR 


2NG1 


ILCI 


IPHT 


lALY 


IVIN 


1A6Q 


1A76 


lAlX 


ICSN 


ITIG 


1A8H 


IBTN 


ICDY 


ICFB 


IMSC 


lAMX 


IHOE 


lUOX 


2PGD 


IBVl 


2PLC 


4MT2 


ISRA 


IDDT 


INSJ 


lUOK 


IPOC 


ISUR 


IGOX 


IGSA 


IMJC 


2PIA 


ILKI 


1BY2 


ISKF 


IBIF 


IPBV 


lALO 


IRMG 


2I1B 


IDPB 


1AJ2 


4PAH 


IFCB 


IPNB 


1BF2 


1AZ9 


1A63 


3TDT 


7TAA 


lOPC 


IPTQ 


IBBA 


IPUC 


IFUA 


IRSS 


IBCL 


ISKZ 


INEU 


lALU 


ICUK 


ICAl 


IMAI 


1AD2 


lOPY 


IBDT 


IBHE 


IJDW 


IPHM 


IDXY 


IVOM ICEO 


1A8L 


ITMY 


ISVB 


lAIL 


IWHO IJDC 


ISFP 


3TSS 


IDUN 


HOW 


IPBB 


IGPR 


1A48 


4BNL 


2PII 


3GCB 


1BG7 


IVLS 


IPUD 


2ABK 


IMDL 


IRKD 


lEUR 


IDMR 


IGND 


lUCH 


1BG2 


lAKO 


lUXY 


2GAR 


ILCL 


IMML 


IPOT 


IQNF 


INPK 


lAYL 


ITIF 


1BD8 


IBDO 


1BG6 


1C3D 


IHYP 


2POR 


lUAE 


1BJ7 


ITML 


ITYV 


IHCL 


2SAK 


IFNA 


1AL3 


2TGI 


2ACY 


ILST 


ILBU 


lAMP 


INAR 


IFAS 


2CBP 


IFMB 


lAXN 


ITUD 


IPDA 


IHAl 


1CV8 


ICHD 


lAMF 


lUSH 


ICPQ 


1BM8 


IXWL 


IBGC 


lA.TJ 


ITFE 


INKR 


IIDO 


IVJS 


IBHP 


IWAB 


IVIE 


IVHH 


IGCA 


IPDO 


IFDR, 


IPMI 


ISBP 


IGOF 


lAKO 


IMOF 2GDM IFXD 


IFNC 


IGAI 


2HFT 


lOSA 


IVNS 


3CHY 


lERV 


IDHN 


lAQB 


ICNV 


119L 


ICEM 


ICXC 


IVCC 


IGVP 


2DRI 


IMBA 


1A3C 


lEDG 


IPHF 


16PK 


451C 


1B6A 


IBKF 


IRZL 


5NUL 


lAOP 


1A8E 


ICVL 


lARV 


INIF 


3CYR, 


IMR.T 


IZIN 


ILAM 


ICSH 


IKUH 


IPTF 


IBFG 


IBFD 


3PTE 


2AYH 


2MYR INOX 


lAKR 


2A0B 


1A8D 


IMOQ IHFC 


1RA9 


ITCA 


3GRS 


2CBA 


IKPF 


BICB 


lAIE 


IKOE 


IWHI 


IRIE 


IMLA 


IHKA 


lOPD 


IFLP 


2MCM 


ICYO 


IPOA 


IBRT 


2HBG 


2SNS 


IXNB 


2RN2 


3SBB 


IBGF 


2END 


lYGE 


3VUB 


2CTC 


IHMT 


IPPT 


IBQK 


lUTG 


IPLC 


IBKO 


IDCS 


1C52 


7RSA 


lOAA 


IMSI 


lYCC 


2PTH 


2SN3 


lAMM 


1BX7 


lATG 


2KNT 


IMUN 1A7S 


ICTJ 


1BS9 


2IGD 


INKD 


3SIL 


2ERL 


1A6M 


ICEX 


IIXH 


IBYI 


lAHO 


INLS 


2PDN 


3LZT 


1RB9 


3PYP 


ICBN 


IGCI 













21 



Supplementary Table II. PDB codes for 213 testing proteins 



1AF5 


IFHE 


ICRY 


lADT 


2CHR 


2LDX 


2UCZ 


IBIA 


lAlS 


1B5M 


IPEX 


3PHV 


lOJT 


2FHI 


IILE 


IFBL 


IGWZ IBIK 


IPMT 


1A06 


4FXC 


2GPR 


IPBK 


2ABL 


ITMO ICYG 


2ALR 


INAT 


IFGS 


1AC5 


IHIB 


IQPG 


ICQA IDOT IMKP 


1AD6 


IBQG 


IDHY 


1A45 


IDIK 


ICIU 


3KAR 


lAAO 


IGDD 


IPHK 


IVIP 


IHAR 


ITSY 


1AW9 


2DAP 


lECY 


2ASI 


lANU 


ICBG 


IFAJ 


1P38 


IBLU 


1A80 


1A8Z 


IKIV 


IMNC 1A3K 


IXAA 


IINR 


IRGP 


lAQl 


1TN3 


2E2C 


IGSH 


1A8P 


IFIL 


IMNl 


IRCI 


IRIS 


IHVF 


lACF 


IBDB 


IPOH 


lESL 


IHCZ 


1A58 


lAEW IBAM ISNP 


IKVY 


ICYI 


lUKZ 


IDOI 


IBGP 


1TN4 


ICYJ 


3PRN 


lENP 


ILML 


IDYR 


lOBM 


1A44 


IZRN 


lAHR 


IXND 


IRDS 


1A68 


IVQB 


lALQ 


IMRG 


1A3D 


IGZI 


1A8S 











lAVC 


ILDB 


1AR2 


IBMP 


lABQ 


ITLK 


lULA 


6FIT 


1A43 


2TPT 


lYFM 


ILUl 


ICPY 


IHJP 


IBAG 


IHUP 


ICYW l.TSG 


IBMG IFSU 


2CND 


ISBF 


2TDX 


IKAS 


1A6I 


4P2P 


ISZT 


IDOL 


lANN 


80HM lASS 


IJNK 


IBET 


8CHO 


2ASR 


lAPA 


2PK4 


IHEY 


IBYT 


lODD IZXQ 


IFTS 


lENY 


lAVK 


lYVS 


IBFS 


5PNT 


lAOK 


lAIR 


1A60 


IBCG 


1A6L 


3ERK 


IBED 


IBKL 


lENH 


lAOB 


2PSR 


1BB6 


ILPP 


1BK2 


3KVT 


INFO 


1AZ6 


IPVL 


IBKM IVPE 


IMHO IBKl 


1AE7 


lAYI 


INHP 


HAG 


ICGT 


1AA2 


2EBN 


1BK9 


1AK2 


3GAR 


lEIF 


ICLC 


4RHN 


IDFX 


1A8B 


lYMV lAXO 


IHFX 


ILRA 


IRCY 


INUC 


lJUG 


2HTS 


IRFS 


lAMK IIVD 


IIFT 


INFN 


lARS 


1A7E 


2SLI 


IGBS 


3KIV 


IBZA 


1AT5 


2ERA 



22 



Figure Caption 

Figure 1: Plot of the optimal e parameters versus those determined using a statistical scheme, 
using a training set of 387 proteins. 

Figure 2: Results of the threading tests for 213 proteins arranged according to their length, 
N . Only the failed cases are shown. The top panel shows a plot of the number of decoys 
that performed better than the native state structure versus N whereas the bottom panel 
shows a similar plot for the couplings that were determined statistically. Note the disparity 
in the scales of the y-axes. 

Figure 3: The top three contributions to e{i,m) as emerging from the SVD analysis. The 
numbers in the ovals indicate the mode number. The letters at the top left of each segment 
of two panels indicate amino acids (in the single letter code) for which this particular mode 
is dominant. The top panels in each segment show the modes - the values of f^-^ for the 
nine values of the environmental variable m. For each kind of secondary structure, the 
environments are listed in order from the small to large exposure. The bottom panels show 
the amino-acid-dependent weights y(„) with which the displayed mode contributes to the 
score in a given environment. 

Figure 4: Plot of the zero "temperature" free score and the native state score of each of the 
proteins in the training and test sets. 



23 



FIGURES 




FIG. 1. 



25 






30000 



20000 - 



10000 - 



50 



15^ 



N 



250 



FIG. 2. 



27 



3) 

C F I V H S T 
N P Q E R K 



0.5 


-0.5 



a jg loop 

— ^ — — 



m 









Bs 


B3 


^ — • — • — • — * 








— • — •- 


. . . .E,, . . . 


E2 



CFWI VMLGYAHDSTNPQERK 



W M Y D 







0.5 

D 

-0.5 


- 1 -B- ^ 1 

1 * 1 


- I 1 * * 

1 1 ill 1 ill 1 1 








WYQRVLT I FKSAEGHNMCPD 



L A 



3 

3 



0.5 


-0.5 


— * 1 M 1 a- 
* 1 * 1 

•0- 1 

1 1 ill 1 f 1 1 1 1 










1 1 1 





1^8 



CFKRQEDYVMT I NWGHSAPL 



FIG. 3. 



29 



FIG. 4. 



31 



